Developing Data Scientists: Exploring Free Code Camp’s “2016 New Coder Survey”

Structure of Dataset

The original “2016 New Coder Survey” dataset consists of 113 variables. Most of these variables are answers to survey questions, though a few are computer-generated (e.g. respondent ID and survey start/end times). Over 15,000 observations (i.e. respondents) exist.

The str function output is long and messy, so I won’t print it here. Please consult Free Code Camp’s list of survey questions and possible answers. Boolean, numeric, and categorical types are the majority.

New Variables

I created six new variables from existing variables:

  • ContinentCitizen and ContinentLive from CountryCitizen and CountryLive using Vincent Arel-Bundock’s countrycode R package
  • PodcastPartiallyDerivative, PodcastBecomingDataSci, and PodcastTalkingMachines from PodcastOther using ifelse statements
  • HoursLearningBucket using the cut function on HoursLearning

These new variables bring our total to 119 variables.

## [1] 15620   119

Data Science/Engineering Subset

646 respondents answered “Data Scientist/Data Engineer” to the question: “Which one of these roles are you most interested in?

## [1] 646 119

The following analysis first explores the characteristics of these developing data scientists/engineers, then dives deeper into the characteristics of new coders in general.

Additional comments are included where the results significantly differ from the full new coder survey dataset.

The univariate section intentionally mimics the structure of Free Code Camp’s Medium article for a direct comparison of data science/engineering students and new coders in general. A few additional univariate plots are included to smooth the transition to the plots explored in the bivariate and multivariate sections.


Univariate Plots

Who Participated

CodeNewbie and Free Code Camp designed the survey, and dozens of coding-related organizations publicized it to their members.

Of the 646 developing data scientists and data engineers who responded to the survey:

A quarter are women.

##    female 
## 0.2447917

Their median age is 26.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.00   22.00   26.00   27.72   31.25   65.00      74

They started programming an average of 16 months ago.

This average is 5 months longer than the full survey dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.00    8.00   16.17   20.00  360.00      31

The median programming experience of 8 months is much clearer after logarithmically transforming the long tail data.

Learner Goals and Approaches

The average respondent dedicates 14 hours per week to learning.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   14.41   20.00   80.00      30

No respondents want to freelance or start their own business.

Compared to 40% for the full new coder survey, this is a bit shocking. I understand the demand for data scientists and engineers in industry, but I have a hunch these zero counts are caused by the survey’s design. Every respondent that answered the job role of interest question has zero counts for “start your own business” and “freelance.”

52% percent are already applying for jobs, or will start applying within the next year.

The data-related subset has a longer time horizon than the full survey dataset, where 65% are applying within the next year.

Most of them want to work in an office, as opposed to remotely.

And a majority are willing to relocate.

Most of them have not yet attended any in-person coding events.

On average, they use at least three different resources for learning.

The developing data scientists/engineers use Coursera, edX, and Udacity more frequently than new coders in general. These companies have wider subject area scopes than the some of the coding-specific resources listed.

Only 1% have attended a bootcamp.

6% of new coders from the full survey dataset have attended a bootcamp.

Demographics and Socioeconomics

Data-focused respondents represent 166 countries.

More than 90% are from North America, Europe, and Asia.

The dominating percentage of North Americans should be expected because Free Code Camp is based in the United States.

Their cities span a wide range of urbanization levels.

Just under a quarter of respondents are ethnic minorities in their country.

And nearly half are non-native English speakers. They grew up speaking one of 148 languages.

67% have earned at least a bachelor’s degree.

Compared to 58% for the full new coder survey, the data-focused subset is more skewed towards post-secondary studies.

Just over one-half are currently working.

Two-thirds of new coders, in general, are currently working.

A quarter work in the tech industry.

Employment fields are more spread compared to the full new coder survey, where 50% of respondents work in software development and IT.

Median current salary is $44k.

The median current salary for the full dataset is $37k.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   25000   43600   48420   60000  200000     390

And they expect to earn a median of $60k with their new data science/engineering skills.

The median for the full survey dataset is $50k. With data science/engineering being notoriously lucrative in 2016, some respondents might be seeking higher wages.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   60000   61110   80000  200000      65

7% have served in their country’s military.

## has served in military 
##             0.06501548

13% have children, and another 3% financially support an elderly or disabled relative. And one-fifth are doing this without the help of a spouse.

## has children 
##    0.1346749
## financially supporting 
##             0.03250774
## no spouse 
## 0.2137405

47% consider themselves underemployed (working a job that is below their education level).

## is underemployed 
##        0.4705882

If they have a home mortgage, they owe an average of $194k.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   76000  150000  194400  240000 1000000     591

If they have student loans, they owe an average of $37k.

This average is $3k more than the full survey dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   10000   20000   36880   45000 1000000     485

Removing the million dollar outlier, the distribution is much clearer with the majority of debt under $75k. I hope that outlier is a joke.

14% don’t yet have high-speed internet at home.

## has high-speed internet 
##               0.8573913

And 3% are currently receiving disability benefits from their government.

## is receiving disability benefits 
##                       0.02608696

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

There isn’t really a singular main feature of interest in the “2016 New Coder Survey” dataset. There are several smaller features, but nothing stands out like diamond price and its relationship to carat weight, cut, colour, etc. in the R diamonds dataset, for example. The diamonds dataset covers two time periods (the existence of the diamond pre-sale and post-sale), whereas the survey dataset only covers a single period (the early stages of an individual’s coding career).

If we could fast-forward several years and survey the same respondents, the main feature of interest might be career earnings (adjusted for cost of living, preferably) and/or self-reported career satisfaction. A predictive model using a combination of variables from the 2016 survey could then be built to estimate career success.

If the survey asked “Are you already working as a data scientist/engineer?” instead of “Are you already working as a software developer?”, the current income variable might be a main feature of interest. Unfortunately, the answer to that question cannot be extracted from the existing variables.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Though there isn’t a main feature of interest, we can separate the respondents who did not answer “Data Scientist/Data Engineer” to the job role interest question (as we already have for those who did) and compare the two subsets using bivariate and multivariate plots.

I will also explore two smaller features, hours dedicated to learning per week and expected next salary, using bivariate and multivariate plots.

Of the features you investigated, were there any unusual distributions?

There is a lot of long tail data. Most did not require transformation to view the details of the distribution. Programming experience is really positively skewed, however, and required log transformation to visually compare those with 3 months experience to those with 25 years.

That no respondents want to freelance or start their own business seems strange. Perhaps a survey design choice caused these zero counts.

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The following operations were performed to tidy, adjust, or change the form of the data:

  • Each code event, resource, and podcast is represented by a boolean variable. I summed the number of yeses for each, which created a single row of sums. I used tidyr’s gather() to transform the data from a wide format to a long format. Then I transformed the long data into factor format, using the replicate function with the number of yeses as the multiplier. This data is used to create the code event, resource, and podcast bar charts. wide to long to factor formats
  • After subselecting all code event, resource, and podcast columns separately, I created a new boolean variable named answered, where 1 represents using at least one event/resource/podcast and 0 represents using none. The answered sum total is used in the “x out of 646 developing data scientists/engineers answered” label at the bottom of each bar chart. label
  • I separated data-specific podcasts in the user-inputted PodcastOther category into their own boolean variables.
  • I changed “NA” in the EmploymentStatus variable to “other” if the respondent provided the user-inputted EmploymentStatusOther variable.
  • I changed “NA” in the EmploymentField variable to “other” if the respondent provided the user-inputted EmploymentFieldOther variable.
  • I separated the “Americas” continents outputted by countrycode() into North and South America.

The first five operations were performed so bar charts could be created, which wasn’t possible with the original data format. The Americas separation was performed for additional insight.


Bivariate Plots

14974 respondents did not answer “Data Scientist/Data Engineer” to the question: “Which one of these roles are you most interested in?

## [1] 14974   119

SPLOMs

The next two plots are created using pairs.panels() from the psych package. They display a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.

For the data science subset of the survey, all correlations are below 0.4, which supports my statement that no main feature exists. The strongest of the correlations are:

  • Age and Income (0.30)
  • Income and ExpectedEarning (0.36)
  • Income and StudentDebtOwe (0.34)

The phenomena revealed are intuitive, but not groundbreaking: you tend to make more money when you are older, you tend to expect your next job to have a high salary if your current one does, and expensive schooling tends to lead to higher income levels.

For the non-data science subset of the survey, all correlations are again below 0.4. Most of the correlations are within 0.1 of the data science subset, except for three:

  • Age and StudentDebtOwe (0.24 - 0.10 = 0.14)
  • MonthsProgramming and StudentDebtOwe (-0.07 - 0.09 = -0.16)
  • Income and StudentDebtOwe (0.34 - 0.08 = 0.26)

Interesting. Student debt levels are involved in all three correlations. I bet the aforementioned skew towards post-secondary studies for the data science subset plays a role here, where higher levels of student debt are expected.

Let’s zoom in on the strong age-income correlation, this time for the full survey dataset. Note that the strength exists despite the majority of $200k salaries belonging to respondents under 40.

The earnings vs. age trend, however, isn’t maintained as these individuals prepare to transition to their new field of choice. Younger individuals appear willing to capitalize on lucrative tech salaries and older individuals appear willing to take a pay cut.

Gender and Citizenship

Let’s use the full new coder survey for the rest of the analysis. We’ll explore hours dedicated to learning per week and expected next salary. These are variables dependent upon the quality of coding resources, whereas the other numerical ones (e.g. age, income, and programming experience) are set previously.

For the following boxplots, the horizontal line is the median and the “x” is the mean. The top of the box is the third quartile and the bottom is the first quartile. Whisker length is the interquartile range multiplied by 1.5.

Hours dedicated to learning results are nearly identical across genders. Do trans new coders spend more time learning? A small sample size issue exists, but I wouldn’t be shocked if a true effect is present here.

## 
##        male      female genderqueer     agender       trans 
##       10766        2840          66          38          36


Not much differentiation for continents as well. All have a median of 10 hours dedicated to learning per week. Asian and African students have the highest means, at 16.4 and 16.8 hours, respectively.

## 
## North America        Europe          Asia South America        Africa 
##          6744          3358          2178           567           506 
##       Oceania 
##           301


Females actually expect higher salaries than males, with a $9k gap in medians and a $4k gap in means. There is a huge gap in first quartiles, where the 25th percentile female expects $14k more than her male equivalent. As with hours dedicated to learning, transgender new coders have relatively higher expected salaries. Did a particularly ambitious set of trans individuals respond to the survey or are these their true traits?

## Gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   30000   50000   52620   70000  200000    6763 
## -------------------------------------------------------- 
## Gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   43650   59000   56620   70000  200000    1532 
## -------------------------------------------------------- 
## Gender: genderqueer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    7000   50000   60000   66970   70000  200000      37 
## -------------------------------------------------------- 
## Gender: agender
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   24000   36000   46500   58220   67500  200000      20 
## -------------------------------------------------------- 
## Gender: trans
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   20000   44250   67500   67230   76250  200000      20


Whoa. Expected earning by continent varies way more compared to the above three boxplots. North Americans expect the highest range of salaries, with their interquartile range spanning from $50k to $70k. Europe’s 75th percentile is North America’s 25th percentile. I wonder if some European respondents forgot to convert from pounds or euros to US dollars. Expectations in Asia are all over the board.

A lot of these individuals are using similar, if not the same, online educational resources. Labour market economics are cruel.

## ContinentCitizen: North America
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   50000   60000   61820   70000  200000    3700 
## -------------------------------------------------------- 
## ContinentCitizen: Europe
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   15000   30000   36010   50000  200000    2178 
## -------------------------------------------------------- 
## ContinentCitizen: Asia
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   18000   48000   51050   70000  200000    1470 
## -------------------------------------------------------- 
## ContinentCitizen: South America
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   18000   36000   40300   60000  200000     414 
## -------------------------------------------------------- 
## ContinentCitizen: Africa
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   20000   45000   52290   70000  200000     361 
## -------------------------------------------------------- 
## ContinentCitizen: Oceania
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10000   40000   50000   54810   60000  200000     178


The median respondent that dedicates 40+ hours per week expects $10k more than the median respondents from the other brackets.

## 
##  (0,10] (10,20] (20,40] (40,80] 
##    8175    3564    2394     636


Let’s dig into that 40-80 hour bracket. Less than 5% of respondents are dedicating 40+ hours to learning each week. Below are the most common ages (the top row is age and the bottom row is number of respondents) and educational backgrounds for this bracket.

## 
## 25 21 26 23 24 20 22 32 30 27 
## 40 38 36 35 33 29 29 28 26 24
## 
##                       bachelor's degree 
##                                     249 
##          some college credit, no degree 
##                                      94 
## high school diploma or equivalent (GED) 
##                                      65 
##      master's degree (non-professional) 
##                                      43 
##                        some high school 
##                                      28 
## professional degree (MBA, MD, JD, etc.) 
##                                      24


Most of these respondents are in their early twenties and have a bachelor’s degree. It appears that they are forgoing traditional forms of higher education like master’s and professional degrees and using those 40+ hour weeks to learn code.

This is the exact situation I’m in with my personalized data science master’s degree. The quality and affordability of online education in 2016 is incredible, though many still aren’t aware of the existence of resources like Free Code Camp, Udacity, and Coursera. If this survey was performed in a few years, I would expect more respondents to be in the higher brackets.

Job Roles of Interest

Again, the most common job roles of interest are:

## 
##         Full-Stack Web Developer          Front-End Web Developer 
##                             2571                             1379 
##           Back-End Web Developer   Data Scientist / Data Engineer 
##                              704                              646 
##                 Mobile Developer         User Experience Designer 
##                              414                              275 
##                DevOps / SysAdmin                  Product Manager 
##                              219                              191 
##       Quality Assurance Engineer 
##                              104


User Experience Designer is by far the most diverse discipline in terms of gender, with about the same amount of males as females and the highest percentage of agender, genderqueer, and trans respondents. Mobile development is the most male-dominated discipline near 80%, though full-stack and back-end development are close.

The highest relative popularity for North America (read: biggest purple bar segment) is user experience design. Europe’s is back-end development. Asia’s, South America’s, and Africa’s is mobile development. Oceania’s is data science/engineering. Mobile developer is the most diverse discipline in terms of citizenship.

The skew towards post-secondary studies for data science and data engineering is much clearer here. Mobile development has the highest percentage of respondents with no, some, or only a high school education. This skew will surely reflect itself in the subsequent age boxplot.

Mobile developers are indeed the youngest with a first quartile of 20 years, two years younger than the next youngest discipline. The remaining disciplines are fairly close in age, with front-end development being the oldest with a mean age of 29 years.

## JobRoleInterest: Full-Stack Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   11.00   23.00   27.00   28.94   33.00   70.00     294 
## -------------------------------------------------------- 
## JobRoleInterest: Front-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   12.00   24.00   27.00   29.08   33.00   64.00     193 
## -------------------------------------------------------- 
## JobRoleInterest: Back-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.00   22.00   27.00   28.03   32.00   59.00     103 
## -------------------------------------------------------- 
## JobRoleInterest: Data Scientist / Engineer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.00   22.00   26.00   27.72   31.25   65.00      74 
## -------------------------------------------------------- 
## JobRoleInterest: Mobile Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    12.0    20.0    24.0    26.2    31.0    54.0      77 
## -------------------------------------------------------- 
## JobRoleInterest: UX Designer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   12.00   22.00   26.00   28.74   32.00   73.00      38


Data scientists-, data engineers-, and back-end developers-in-training have programmed the longest with a median experience of 8 months. UX designers have the lowest first quartile by two whole months at two months of programming experience.

## JobRoleInterest: Full-Stack Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.250   0.500   1.043   1.000  40.830      88 
## -------------------------------------------------------- 
## JobRoleInterest: Front-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.2500  0.5000  0.7917  1.0000 15.0000      43 
## -------------------------------------------------------- 
## JobRoleInterest: Back-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.3333  0.6667  1.2680  1.6250 20.0000      33 
## -------------------------------------------------------- 
## JobRoleInterest: Data Scientist / Engineer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.2500  0.6667  1.3470  1.6670 30.0000      31 
## -------------------------------------------------------- 
## JobRoleInterest: Mobile Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.250   0.500   1.049   1.250  13.330      15 
## -------------------------------------------------------- 
## JobRoleInterest: UX Designer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1667  0.5000  1.0610  1.0000 36.0000      20


Full-stack developers dedicate the most time to learning each week, with 25% of respondents dedicating 30+ hours weekly. UX designers spend the least amount of time learning per week with a mean of 12 hours per week.

## JobRoleInterest: Full-Stack Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   10.00   15.00   19.94   30.00  100.00     108 
## -------------------------------------------------------- 
## JobRoleInterest: Front-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     6.0    12.0    16.7    20.0   100.0      48 
## -------------------------------------------------------- 
## JobRoleInterest: Back-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    8.00   15.00   18.77   25.00  100.00      40 
## -------------------------------------------------------- 
## JobRoleInterest: Data Scientist / Engineer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   14.41   20.00   80.00      30 
## -------------------------------------------------------- 
## JobRoleInterest: Mobile Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   12.00   17.76   25.00  100.00      21 
## -------------------------------------------------------- 
## JobRoleInterest: UX Designer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   12.04   15.00   63.00      19


Respondents interested in data science and/or engineering clearly have the highest current salaries. Their third quartile of $60k per year is $8k higher than the next highest discipline. There isn’t much income differentiation between the remaining job roles of interest.

## JobRoleInterest: Full-Stack Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   20000   35000   41010   52000  200000    1508 
## -------------------------------------------------------- 
## JobRoleInterest: Front-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   20000   35000   37020   48000  200000     806 
## -------------------------------------------------------- 
## JobRoleInterest: Back-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   17750   32000   36990   49250  200000     436 
## -------------------------------------------------------- 
## JobRoleInterest: Data Scientist / Engineer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   25000   43600   48420   60000  200000     390 
## -------------------------------------------------------- 
## JobRoleInterest: Mobile Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   20000   33800   36420   46500  155000     286 
## -------------------------------------------------------- 
## JobRoleInterest: UX Designer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   20000   31500   35730   50000   90000     175


Respondents interested in data science/engineering expect to earn the most at their next job. Given the aforementioned correlation between current salaries and expected salaries, this is not a surprise. Note that expected salaries are higher than current salaries (see the previous boxplot) across the board.

## JobRoleInterest: Full-Stack Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   55000   54670   70000  200000     225 
## -------------------------------------------------------- 
## JobRoleInterest: Front-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   30000   50000   48070   60000  200000     118 
## -------------------------------------------------------- 
## JobRoleInterest: Back-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   30000   50000   50060   65000  200000      73 
## -------------------------------------------------------- 
## JobRoleInterest: Data Scientist / Engineer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   60000   61110   80000  200000      65 
## -------------------------------------------------------- 
## JobRoleInterest: Mobile Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   30000   50000   52740   70000  200000      48 
## -------------------------------------------------------- 
## JobRoleInterest: UX Designer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   51000   55100   70000  200000      40

Bivariate Analysis

How did the feature(s) of interest vary with other features in the dataset?

The data science/engineering subset of the survey is largely similar to the non-data science/engineering subset, except for three correlations involving student debt owed. The skew towards post-secondary studies for the data-focused subset is the likely culprit.

The correlation between current salary and age is stronger than expected next salary and age for the respondent’s first data science/engineering job and age.

Hours dedicated to learning per week doesn’t appear to vary much with gender or continent, though sample size issues exist.

Expected salary for a respondent’s next job varies strongly by continent. Females also appear to have a much higher bottom line for expected salary than males. Those who dedicate more than 40 hours a week to learning expect higher salaries as well.

The majority of respondents for all job roles of interest are male, North American, and have bachelor’s degrees. Age, programming experience, hours dedicated to learning, current salary, and expected next salary all vary depending on job role of interest. One or two of the disciplines stands out from the pack for each of the five quantitative variables.

What was the strongest relationship you found?

No exceedingly strong relationship exists. All correlations are below 0.4.

Current salary and expected next salary has the strongest relationship for both subsets with correlations of 0.36 and 0.38.

Of the features you investigated, were there any unusual distributions?

Europe’s 75th percentile for expected next salary is North America’s 25th percentile ($50k USD). Perhaps some European respondents forgot to convert from pounds or euros to US dollars.


Multivariate Plots

Let’s dig deeper into the strongest correlation: current salary versus expected next salary. Again, this new job is presumably where the respondent will put their new coding skills to use.

Plotting income vs. expected earning across genders, the first impression is that there are a lot of male data points. This abundance makes it hard to tell if the wage gap presents itself in this dataset. Looking at each gender’s presence above the $50k lines, my first instinct is that the gap exists. Males definitely have the highest proportion of $150k+ salaries. Stay tuned for the final plots section, where I’ll determine the presence of the gap definitively.

The same multivariate plot is generated below, but for the question “Are you an ethnic minority in your country?” Again, it is difficult to determine definitively if a wage gap is present. Non-ethnic minorities definitely have a higher proportion of $150k+ salaries, but a huge amount of data points are clustered in the bottom left quadrant. It looks like minorities are better represented above the $50k expected salary line, but not as much for the $50k current salary line. We’ll see for sure if this is indeed true in the final plots section.

Let’s combine all of the purple/pink boxplots from the bivariate plots section into one radar chart. The mean for each numerical variable normalized between 0 and 1 is plotted for each job role of interest.

One thing jumps out immediately: developing data scientists/engineers lead the pack for programming experience, current salary, and expected next salary. Beyond that, however, overplotting is again an issue, which makes it difficult to internalize other patterns in the data. I’ll fix that for all three multivariate plots in the final plots section.

Multivariate Analysis

Were there any interesting or surprising interactions between features?

Males and ethnic majorities dominate the $150k+ salary range, but an overall wage gap between genders and ethnicities for this dataset is not definitive when all the data points are plotted together. I thought the gaps would instantly be clear. More exploration needs to be done, and more will be done in the next section.


Final Plots

Plot One

Description One

For males vs. females, the density overlay tells us that the wage gap, i.e. males earn more than females, actually does not exist in this dataset. Though males do have the highest proportion of elite ($150k+) current and expected next salaries, it appears that a similar proportion of males and females are in the top right quadrant above both $50k lines. Females expect higher next salaries, while current salaries are similar.

An ethnicity-based wage gap does not exist as well, based on the density overlay for the second plot. Current salary densities are nearly identical. Ethnic minorities appear optimistic about the changing diversity landscape via their notable presence in the top left quadrant. This quadrant is where current salaries are below $50k, but expected next salaries are above $50k.

Higher dispersion exists for the majority demographic in both cases, with notable densities near the origin. The relationship between expected and current salary is much stronger for the minority demographic.

Perhaps new coders aren’t reflective of the working population in general, where data suggests that a racial and gender wage gap still exists in 2016.

Plot Two

Description Two

The dataset’s two main categorical variables - gender and citizenship by continent - convey the basic demographics of each job role of interest.

The majority of survey respondents are males and North Americans. Mobile development has the highest percentage of males. User experience design has the highest percentage of North Americans.

User experience designer is the most diverse role in terms of gender, with about the same amount of males as females and the highest percentage of agender, genderqueer, and trans respondents.

Mobile developer is the most diverse role in terms of citizenship, with the lowest percentage of North Americans and relatively high percentages of Asians, South Americans, and Africans.

Plot Three

Description Three

This faceted radar chart, where the normalized mean (between 0 and 1) for each numerical variable is plotted for each job role of interest, clarifies the differences between disciplines.

Developing data scientists/engineers make the most money, expect the most money for their next job, and have the most programming experience. They have the largest amount of area within their polygon.

Full-stack developers are relatively older and dedicate the most amount of time to learning weekly. They also have a large polygon area.

Front-end developers are green in terms of programming experience and have the lowest salary expectations for their first job where they advertise their new web development skills. They also have relatively low current salaries. These three factors contribute to the smallest polygon area.

Mobile developers are the youngest and currently do not make much money. These characteristics are expected of the discipline with the highest proportion of respondents with no, some, or only a high school education. They have the second smallest polygon area.


Summary

Developing data scientists and engineers are slightly different than new coders in general.

  • They have programmed for longer.
  • They want to work for developed companies, rather than freelance or create their own.
  • They have a longer job search time horizon.
  • They use Coursera, edX, and Udacity more frequently.
  • They use bootcamps less frequently.
  • They have completed higher levels of education.
  • They come from a wider subject area background.
  • Fewer are currently working.
  • Fewer work in the tech industry.
  • They have more student debt.

The two datasets do share plenty of common trends. Demographics are similar. Most are willing to relocate. Most don’t use podcasts or attend events yet.

Older new coders are willing to take a pay cut when transitioning to a job where they advertise their new coding skills. Younger new coders intend to increase their earning potential by capitalizing on demand for coding.

Weekly hours dedicated to learning doesn’t differ much across genders and citizenships by continent. Next expected salary does, however. Most people aren’t replacing the traditional college/university route with full-time online education…yet. Those that are expect higher salaries.

Gender and continent distributions across job roles of interest vary. Females appear drawn to user experience design. Asians, South Americans, and Africans appear drawn to mobile development. School degree obtained does not vary much by discipline overall, though data science/engineering and mobile development stick out as the most and least seasoned in terms of education, respectively.

Developing data/scientists have the highest current salaries, expect the highest next salaries, and have the most programming experience. Front-end developers are the oldest. Full-stack developers dedicate the most amount of time to learning per week.

Mobile developers are the youngest and have the lowest current salaries. Front-end developers are the least experienced coders and expect the lowest next salaries. UX designers spend the least amount of hours learning weekly.

The gender and racial wage gaps do not present themselves in this dataset. Perhaps new coders aren’t reflective of the working population in general.

Reflection

The successes of this exploration are largely due to the detailed design of the Free Code Camp survey.

The main struggle I encountered in this exploration was the lack of a main feature of interest, like the diamond dataset’s price variable. It would be awesome if we could survey the same respondents in a decade or so. We could combine career earnings and career satisfaction with the 2016 survey’s results to build a predictive model to estimate career success.

These are the people who are learning data science and engineering. It is clear that free, self-paced learning resources are important.